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ABSTRACT 

We consider the problem of uniformly distributing the load of a parallel program 
over a multiprocessor system. We analyze a program whose structure permits the 
computation of the optimal static solution. We then describe four strategies for load 
balancing and compare their performance. 

The four strategies are (l) the optimal static assignment algorithm which is 
guaranteed to yield the best static solution, (2) the static binary dissection method which 
is very fast but sub-optimal (3) the greedy algorithm, a static fully polynomial time 
approximation scheme, which estimates the optimal solution to arbitrary accuracy and 
(4) the predictive dynamic load balancing heuristic which uses information on the 
precedence relationships within the program and outperforms any of the static methods. 

It is also shown that the overhead incurred by the dynamic heuristic (4) is reduced 
considerably if it is started off with a static assignment provided by either (l), (2) or (3). 


Supported by NASA Contracts NAS1-17070 and NAS1-18107 while the authors were in residence at 
the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research 
Center. 
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1. Introduction 

Efficient utilization of parallel computer systems requires that the task or job being 
executed be partitioned over the system in an optimal or near optimal fashion. In the 
general partitioning problem, one is given a multicomputer system with a specific inter- 
connection pattern as well as a parallel task or job composed of modules that communi- 
cate with each other in a specified pattern. One is required to assign the modules to the 
processors in such a way that the total execution time of the job is minimized. 

An assignment is said to be static if modules stay on the processors to which they 
are assigned for the lifetime of the program. A dynamic assignment, on the other hand, 
moves modules between processors from time to time whenever this leads to improved 
efficiency. 

Given an arbitrarily interconnected multicomputer system and an arbitrarily inter- 
connected parallel task, the problem of finding the optimal static partition is very 
difficult and can be shown to be computationally equivalent to the notoriously intract- 
able NP-Complete problems [1]. However, many practical problems have special struc- 
ture that permits the optimal solution to be found very efficiently. 

In this paper we will compare the performance obtained through the use of a 
dynamic load balancing method, a suboptimal but very inexpensive static load balancing 
method and the optimal static load balancing on a problem with a structure that per- 
mits the computation of the optimal balance. We also consider a fully polynomial time 
approximation scheme, the solution of which can be made to approach the optimal load 
balance. These methods for balancing load are suitable for distinct but overlapping 
varieties of problems. These problems can arise, among other places, in the solution of 
systems of linear equations using point or block iterative methods, in problems of adap- 
tive mesh refinements, as well as in time driven discrete event simulation. We describe 
our experience with four different algorithms that we have used to solve a problem for 
which all these methods are applicable. 

The first method finds the optimal static assignment using the bottleneck path algo- 
rithm described in [2]. This algorithm captures the execution costs of the modules or 
processes of the task as edge weights in an assignment graph. A minimum bottleneck 
path in this graph then yields the optimal assignment. This algorithm has moderate 
complexity and is guaranteed to yield the optimal static assignment. 

The second method that we evaluate is the binary dissection algorithm which is 
derived from the work of Berger and Bokhari [3], [4]. This algorithm is very fast but does 
not always yield the optimal static solution. 

The third scheme that we consider is based on a widely used greedy method 
described in [5], which when combined with a binary search yields an approximate solu- 
tion to the static partitioning problem. 

Finally we evaluate the predictive dynamic load balancing method developed by 
Saltz[6]. This is a dynamic algorithm in that modules are reassigned form time to time 
during the course of execution of the parallel program. This heuristic takes the pre- 
cedence relationships of the subtasks into account when deciding whether and when to 
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relocate modules. This additional information and the capability to relocate dynamically 
permits this algorithm to usually outperform the optimal static algorithm. 

The following section discusses in detail the problem addressed in this research. 
Section 3 contains a brief description of the optimal static algorithm. In Section 4 we 
describe the binary dissection algorithm. The greedy algorithm is described in section 5. 
Section 6 contains a description of the heuristic dynamic algorithm and Section 7 com- 
pares the performance of these four algorithms. 


2. Formulation of Problem 

We consider the partitioning on a multiprocessor system of a problem which is com- 
posed of a number of processes or modules with a predictable, repetitive pattern of 
inter-module data dependencies. The computation is divided into steps, and each 
module requires data from a set of other modules at step s-1 to begin the computations 
required for step 8 . 

Problems that exhibit this pattern of data dependence include explicit schemes for 
solving partial differential equations [7], iterative and block iterative methods such as 
Jacobi and multicolor SOR for the solution of systems of linear equations [8] & [9], and 
problems in discrete event simulation [10] and time driven discrete event simulation*. 

The importance of good load balancing strategies is accentuated when the work 
involved in solving a problem separates naturally into a number of subunits that is rela- 
tively small compared to the number of processors utilized, and when partitioning any 
one of these subunits across several processors is inconvienient or expensive. 

For example, consider the solution of an elliptic partial differential equation through 
the use of a block iterative method. The factored submatrices that represent portions of 
the domain of the partial differential equation are used repeatedly to iteratively improve 
an approximate solution of the equation. The computations that must be performed 
using each factored submatrix are forward and back substitution. If there are more fac- 
tored submatrices than processors, it may be computationally more efficient not to 
spread the forward and back substitutions across processors. If the work required to 
iterate using the factored submatrices cannot be evenly divided amongst the processors, 
dynamic balancing of load may be useful in preventing processors from becoming idle 
due to load imbalances. 

Dynamic load balancing becomes particularly desirable in problems in which the 
time needed for a process to complete one step is difficult to determine before the prob- 
lem is mapped onto a machine, or when the time required to complete a step changes 
during the problem’s execution. 

Consider the simulation of physical processes, either by means of solving a partial 
differential equation or by means of a discrete event simulation. The computations relat- 
ing to a particular spatial region may be assigned to a specific process which handles all 
computations describing events occurring in that region. In the case of discrete event 
simulations and methods that solve time dependent partial differential equations using 
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an adaptive grid as part of an explicit timestepping scheme, the activity in a given 
region may vary during the course of the solution of the problem. 

In this paper a method for dynamic load balancing that exploits the repetitive pat- 
tern of data dependencies is presented, and is compared with two static load balancing 
methods. The first finds the optimal solution exactly using the computationally expen- 
sive optimal algorithm or approximately by means of the greedy algorithm and the 
second is an inexpensive heuristic. 

The static load balancing methods yield a mapping of modules to processors. The 
time required to complete a problem is determined by the processor with the heaviest 
load. With the dynamic load balancing method, each module may proceed at a rate 
constrained only by the local availability of computational resources and its data depen- 
dence on other modules. Load balancing is performed in a way that is explicitly 
designed to prevent processor inactivity due to a lack of data availability. 

The performance of the dynamic load balancing method may be expected to depend 
to some extent on the initial balance of load at the time dynamic load balancing is ini- 
tiated. One would expect the performance of the dynamic load balancing method to be 
favorably influenced by the use of static load balancing to improve the initial load bal- 
ance. 

3. The Optimal Static Algorithm 

In this section we discuss briefly Bokhari’s algorithm for optimally partitioning a 
chain structured parallel or pipelined program over a chain of processors [2]. We assume 
that a chain structured program is made up of m modules numbered l..m and has an 
intercommunication pattern such that module i can communicate only with modules 
*+l and i—l as shown in Fig. 1. Similarly, we assume that the multiprocessor of size 
n<m also has a chain like architecture. We work under the constraint that each proces- 
sor has a contiguous subchain of program modules assigned to it. Thus the partitions of 
the chains have to be such that modules i and i+1 are assigned to the same or adjacent 
processors. This is known as the contiguity constraint. The optimal partitioning would 
then be the assignment of subchains of program modules to processors that minimizes 
the load on the most heavily loaded processor. 

The above problem is solved by first drawing a layered graph (Fig. 2) in which 
every layer corresponds to a processor and the label on each node corresponds to a sub- 
chain of modules. Every layer in this graph contains all subchains of modules i.e. all 
pairs <i,j> such that l^i^j^m. A node labeled <i,j> is connected to all nodes 
<j+l,k> in the layer below it for all j except 1 and n. All nodes <l,t> in the first layer 
are connected to node s while all nodes <*,m> in every layer are connected to node t. 
Any path connecting nodes s and t corresponds to an assignment of modules to proces- 
sors. For example the thick edges in Fig. 2 corresponds to the assignment of Fig. 1. 


* D. Nicol and J. Saltz, "A Statistical Methodology for the Control of Dynamic Load Balancing," to 
be published as an ICASE Report. 
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Weights can now be added to the edges of this layered graph as follows. In layer k, 
each edge emanating downwards from node <i,j> is first weighted with the time 
required for processor k to process nodes t through j which accounts for the computation 
time. Now we add the time to communicate between modules 6 and 6+1 over the link 
connecting processors k and k + 1 to the weight of the edge joining node <a,b> in layer k 
to node <6+l,d> in layer k+ 1. It is clear now that there is a path in this graph 
corresponding to every possible contiguous subchain assignment and the weight of the 
heaviest edge in a path corresponds to the time required by the most heavily loaded pro- 
cessor to finish. Thus to find the optimal assignment, we have to find the path in the 
layered graph in which the heaviest edge has minimum weight — the bottleneck path. 

The bottleneck path can be found by using the following labeling procedure. Ini- 
tially all nodes are given labels L(i)=oo except in the first layer, in which all nodes are 
labeled zero. Then starting at the top and working downwards we examine each edge e 
emanating downwards from a layer. If this edge connects node a (above) to node 6 
(below) then replace L(b) by min(L(b),maz( W(e),L(a)) where W(e) is the weight asso- 
ciated with edge e. Once the graph has been labeled, we then find the edge incident on 
node t which has maximum weight. Suppose the edge joining node <i,m> of layer k 
with node t has maximum weight, then it means that the bottleneck path would contain 
the node <*,m> of layer k and thus modules * through m would be assigned to proces- 
sor k. The rest of the bottleneck path can be found in the same manner by working 
upwards from layer k to the top. 

The number of nodes per layer in the layered graph is 0(m 2 ) and thus the total 
number of nodes in the graph is 0(m 2 n). The number of edges emanating from a node is 
at the most m, thus the total number of edges would be 0(m 3 n). As the labeling algo- 
rithm looks at each edge once, therefore the space as well as time required by this algo- 
rithm is 0(m 3 n). 


4. The Binary Dissection Method 

The binary dissection approach to the solution of the basic partitioning problem 
addressed in this paper is very efficient in terms of run time and gives solutions that are 
very close to optimal. This algorithm is a simplified version of the two dimensional par- 
titioning strategy developed by Berger and Bokhari [4], [5], 

The algorithm proceeds as follows. The given chain of m modules is split up into 
two halves such that the difference of the sums of execution costs in each half is 
minimum. The two halves are then recursively subdivided as many times as desired. 
Clearly, the number of pieces into which the chain can be partitioned must be exactly 2* 
where the integer k represents the depth of partitioning. 

Thus this algorithm is useful for problems in which the number of processors is a 
power of 2. The time required by this algorithm is O(mlogn) for a problem with m 
modules and n processors since there can be no more than logn levels of partitioning 
with each level requiring at most one access to each module weight. 
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At first sight this algorithm may seem capable of yielding the optimal solution. 
This is not always so, as the example in Fig. 3 demonstrates. In the next paragraph we 
will find an upper bound on the difference between the optimal solution and the solution 
yielded by the binary dissection method. 

Let W T represent the sum of the weights of all m modules. A lower bound on the 
weight of the heaviest subchain W 0PT in the optimal partition will be W T jn under the 
special case when all the n processors are uniformly loaded. Let us designate the weight 
of the heaviest module by tv max and the weight of the heaviest subchain assigned to a 
processor using the techniques of binary dissection by Wmax- Then whenever a chain is 
divided into two parts, the maximum difference between the two halves will be bounded 
by u> max . Thus if n=2 then W M ax ^ Wf/2 + u> max /2. Similarly if there are n processors 
then an upper bound on W M ax be: 

W MAX ^ W T /n + w max (-— ) 

2 4 n 

< W T /n + w max (n-l)/n 

Thus the maximum difference between W M ax an( f ^ OPT will be given by the fol- 
lowing equation under the assumption that m>n. 

W MAX ~ W OPT ^ u, max( n ~ 1 )/ ri (!) 


5. The Greedy Algorithm 

This algorithm is based on a greedy method, which is a widely used technique and 
is applied to a variety of problems [3]. Sahni [l] has devised a polynomial time approxi- 
mation scheme to solve the knapsack problem using a greedy method while Kemighan 
uses a similar approach [11] for finding optimal sequential partitions of graphs. Utilizing 
this method one can devise an algorithm which works in stages and at each stage a deci- 
sion is made regarding whether or not the next input be included in the partially con- 
structed solution. If the inclusion of the next input will result in an infeasible solution 
then this is not added to the partial solution. Greedy methods may not necessarily pro- 
vide optimal answers. For example consider the binpacking problem: Given a finite set 
W={w l ,w 2 ,...,w m } of m different weights, find a partition of W into n disjoint subsets 
W l ,W 2 ,—,W n , such that n is minimum and the sum of the weights in each subset IF,- is 
no more than a fixed constant. The First Fit algorithm for the above problem is essen- 
tially a greedy method in the sense that it tries to place each weight in the lowest * 

indexed subset as fax as possible, but this does not result in the optimal solution [1]. If 
however we put an extra condition on the problem that weights tu,- and u>,- +1 are to be 
placed in either the same subset or subsets Wj and JF ;+1 respectively then the same 
greedy approach will be able to find the optimal solution. 

The greedy algorithm is based on the function PROBE (described below) and takes 
advantage of the fact that the weight assigned to the most heavily loaded processor in 
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the optimal partition lies somewhere between W T /n and W T fn + w max as discussed in 
the previous section. The algorithm selects a trial weight w in the above range and then 
uses the function PROBE. The function PROBE(w) returns true if it is possible to parti- 
tion the chain of modules into subchains such that the weight of each subchain is less 
than or equal to w, the resulting partition is called the greedy partition(w) and false oth- 
erwise. 

function PROBE{Processors\l..n\, Modvle8[l..m\, u;):boolean; 
begin 

* = i; i - 1; p = i; 

while do 
begin 

Assign the subchain Modules [*. .j j to processor p; 
repeat 
/=/+!; 

until weight of subchain Modxdes[i..j) > w or j=m\ 

If j = m (all modules have been assigned) then return (true); 

* = /+ 1; 3 = *; p = p+i; 
end; 

return (false); 
end. 

The greedy algorithm then makes a binary search in the range W T /n, W T /n + w max 
using the above function to find the partition for which the weight of the heaviest sub- 
chain is minimum. For each trial weight w the function PROBE has to look at each 
module only once. If the above range is resolved to an accuracy of e then the greedy 
algorithm will find a greedy partition(w) in time proportional to O(mlog 2 (io max /e)) with 
the assurance that w is no more greater than the weight of the heaviest subchain in the 
optimal assignment by e. It is important to note that the order of the greedy algorithm 
is proportional to log(«; max /£) unlike other fully polynomial time approximation schemes 
in which the time complexity is polynomial in l/e as described in [l] . 

In the following paragraphs we will prove that if there exists an assignment with 
the weight of its heaviest subchain equal to w then the procedure PROBE will always 
find that or an equivalent assignment assuming that subchains with no modules in them 
(empty subchains) are allowed. 

Definition: The weight of a partition is the weight of its heaviest subchain. 

Notation: 

™ w,n 

a partition with weight w and n subchains. 

1w,n 

a greedy partition with weight w and n subchains. 
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^w,n,A 

a mixed partition with weight w and n subchains in which the partition up to the 

first k subchains is greedy and the remaining partition may or may not be greedy. , 

Observe that *md. p w ,n,n~'ho,n‘ * 

Claim 1: f* Wl n,k can always be transformed into p Witli k+i 

Proof: Move the right hand partition of subchain Ar+l to the right until any further 

movement would cause the weight of subchain &+1 to exceed w or exhaust 
the modules. 

1. If this is possible without disturbing the right hand partition of subchain 
fc+2 then H w<n> k been transformed into M w ,n,Jfc+i and the claim is correct. 

2. If during the course of this movement the r.h. partition of subchain k+1 
coincides with the r.h. partition of subchain k+2> this means that subchain 
k+2 is now empty (which is permitted). Continue movement of both parti- 
tions together, combining with any further partitions that may be encoun- 
tered. When the threshold point is reached, p Witlt ic been transformed into 
f Ji w,n,k+ 1 , one or more subchains to the right of fc+1 are empty but the claim 
is still correct. 

Claim 2: If there exists a ir w n then there must also exist a 7 W)B . 

Proof: Recall that n w>n = V>w,nfl- 

By repeatedly applying transformation (1) above we can transform: 

Result: If there exists an assignment of weight w then the procedure PROBE will 

find that or an assignment of equal weight. 

6. The Predictive Dynamic Load Balancing Method 

We assume that a computation is composed of a fixed number of computational 
processes or modules. The computation is divided into steps, and each module requires 
data from a set of other modules at step a— 1 to begin the computations required for step 
8. Each module may proceed at a rate constrained only by the time required for the ♦ 

processor to perform the computations required by the module, the local availability of 
computational resources and data dependence on other modules. Load balancing is per- 
formed in a way that is explicitly designed to prevent processor inactivity due to a lack 
of data availability. 

The potential work of a processor is defined as the amount of time that will be 
required to advance all modules in a processor as many steps as possible given the data 
currently available from other processors. The parallel efficiency of a processor may be 
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defined as the percentage of time a processor spends performing the computations 
required by the modules assigned to it. Transfers of modules between processors impact 
parallel efficiencies in a machine dependent way. The communication time required to 
transfer a module from one processor to another along with the degree to which that 
communication can be masked with computation are essential factors in this depen- 
dency. 

In the predictive dynamic load balancing method to be discussed here, load is 
shifted between processors in a way that attempts to equalize the potential work in each 
processor. When the potential work of a processor falls below a predetermined thres- 
hold, load balancing is considered. A module is shifted from a neighboring processor 
when the neighboring processor has stored an amount of potential work greater than or 
equal to the threshold plus a pre-determined safety factor. If more than one neighboring 
processor fits this criterion, the processor with the largest potential work contributes a 
module. 

The ability to efficiently calculate the potential work in a processor is central to the 
usefulness of this method. Simple and inexpensive methods for calculating potential 
work will now be described. The potential work stored in a processor may have to be 
calculated from scratch in some situations. When the computations involved in solving 
a problem are initiated or when modules are shifted in or out of a processor after load 
balancing, one must take into account both the pattern of data dependencies within a 
processor and the availability of data from other processors in order to calculate poten- 
tial work. Given a processor which has assigned to it a value for potential work, a 
simpler set of computations can be performed to update the value of potential work in 
response to the receipt of a new datum from another processor. 

It is useful at this point to describe in more detail the interaction between step 
numbers achievable by the modules assigned to a processor and the external data avail- 
able to the processor. A linked data structure representing an undirected graph 
DEPEND, with weighted vertices is defined for each processor P. The vertices represent 
the modules in P as well as the modules in other processors directly coupled to modules 
in P. Let z,- l^i^B represent boundary vertices and let t>,- l^t'^7 represent vertices 
within the processor. The weight w i of each vertex v, represents the largest step reach- 
able by each module, given the currently available boundary information. The weight q { 
of each of the vertices z,-, represents the step of the largest available boundary variable 
. data for the module. 

The largest step reachable by a vertex v,- in the processor given currently available 
* boundary data is determined by adding one to the minimum of: (1) the largest steps 

reachable by all internal vertices Vj linked to v and (2) the step number of the latest 
available boundary data for the boundary vertices z t linked to v. The weight assigned to 
v,- may be written as 

w,=min(«;y, 9 ,)+l (2) 

) , i v ' 


where Vj and z/ are linked to t>,-. 
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Denote the current step number of w,- as s i and the time required to advance v t - one 
step The potential work associated with P at a given point in the computations may 
be written as 

E (3) 

t v ' 


where the sum is over all » corresponding to v, in P. For each boundary vertex z; the 
graph DEPEND may be divided into equivalence classes based on the minimum number 
of edges that have to be traversed to get to z t -. We define r k j as the equivalence class of 
z k to which Vj belongs. Note that each internal Vertex belongs to B different equivalence 
classes, one corresponding to each boundary vertex z k , l^k^B. The proposition below 
states a sort of superposition principle that holds for the determination of the maximum 
achievable cumulative microstep for internal vertices in response to constraints arising 
from boundary vertices. 

Proposition: The weight of t/,- is given by 


W; 


min 

l^k^B 


(% + r M ) 


( 4 ) 


The proof is carried out by substituting the postulated solution into (2) . Fix atten- 
tion on an internal vertex u,-. Corresponding to each r k i where r k i ^2 there must be an 
internal vertex Vy linked to u f - with r k j=r k i —l. If there were not, it would not be possi- 
ble to find a shortest path from t;,- to z k consisting of r k i edges. Moreover, there cannot 
be an internal vertex Vj connected to t;,- with r k } <r k i —l, if there were, then v ( would 
have a shortest path to z k consisting of fewer than r k i edges. Corresponding to each r ki 
where r* ,•= 1 there is a direct edge from i/,- to z k . 

Now substituting (4) for each Vy into (2) yields 


=min[ min ( q k + r k> + 1 
3 , * 

for all j,l such that v } - and z ( are linked to t;,-. Equation 5 may be rewritten as 



min [min((^+r A y), 9; )]+l 

1^*^ ) l • 


For each k, there exists an internal vertex vy with r k j=r k>i — 1 connected to v, and there 
cannot be a vertex uy where r^ y < r^ t - — 1. Hence from (5) we obtain (6) 

«>,= min f^+(r A)l -l), 9/ ]+l ( 6 ) 


For boundary vertices Zi to which t;,- is directly connected, 1. Since all quantities 
involved are positive in sign, we obtain from (6) the equation (4) for v,- as desired. 

We are now in a position to calculate the potential work from scratch, given values 
of a,- and £,• corresponding to all vertices u,- in P. For each t>,- in P one may calculate «;,• 
from (4) in 0(B) operations per vertex. Since there are I vertices the calculation of 
potential work from scratch requires O(IB) operations. 
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If a processor has a value of potential work assigned to it the potential work may be 
updated in response to the receipt of a boundary datum. One finds the weights for each 
vertex v i in P in the following way. By equation (4) incrementing the weight of a single 
boundary vertex can either leave the weight of interior vertices unchanged or increase 
the weight by one unit. Moreover, only interior vertices currently constrained by the 
incremented boundary vertex will have their weights incremented. 

In response to an increment in a boundary vertex z k , the weights in equivalence 
classes may be adjusted in order of increasing equivalence class number with only one 
pass necessary. Assume that z k has had its weight incremented from q k — 1 to q k . Before 
z k was incremented, the constraint on the weight of vertices in equivalence class r k -n 
was q k — 1 + n. The constraint on the weight of vertices in equivalence class r k =n — 1 after 
z k is incremented is ?*+(«— 1). The adjusting of equivalence class r k — n will have no 
effect on the adjustment of equivalence class r k = n— 1. 

If a vertex in equivalence class r k = n has a weight of less than q k +n — 1 before 
being considered for readjustment, it is not being constrained by z k . Incrementing z k s 
weight will consequently not affect the vertex. Since the only vertices which can possi- 
bly have their weights incremented have weights q k +n— 1, the order in which vertices in 
an equivalence class are considered is unimportant. 

Updating DEPEND may proceed as follows. The weight of the vertex in DEPEND 
representing z k is first incremented. In a breadth first manner beginning with the vertex 
representing z k , DEPEND is searched for vertices whose weights must be incremented. 
When a vertex v is found that does not require a weight increment, the search does not 
continue to examine other vertices linked to v. 

In the model problem, the time and space requirements of this updating algorithm 
algorithm are O(nm) and 0(m) where n is the number of modules in the problem and 
m is the number of steps over which advancement is to proceed. 


7. Comparison of Results 

We have compared the performance of both the static load balancing methods and 
the predictive dynamic method through a variety of simulations. Note that with 
minimal computational effort, on a set of weights consisting of single precision floating 
point numbers numbers, the greedy approximation scheme produces a balance identical 
to the optimal load balance. Thus, the performance obtained through the use of the 
optimal method and the greedy approximation scheme were identical, and in this section 
we shall simply refer to the performance of the optimal load balancing method. 

Static and dynamic methods can be combined; a static load balancing may be per- 
formed before beginning work on a problem, and a dynamic load balancing policy may 
be utilized once work on the problem has begun. It is found that the initial use of static 
load balancing policies can enhance the performance of the dynamic policy and that 
both the optimal and the binary dissection static load balancing methods yield rather 
comparable performance when used with the dynamic predictive load balancing method. 
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Used without a dynamic load balancing method, the optimal load balance was found to 
be notably superior to binary dissection, while there was hardly any difference between 
the optimal load balance and the greedy load balance on the test problems described 
here. 

We consider a system with 16 processors and a fixed number of modules. In each 
trial, random deviates representing the weights of modules are drawn from a truncated 
normal distribution. For each set of random deviates, both the optimal static load bal- 
ance and the binary dissection balance are calculated and the performance is tabulated. 
Simulations utilizing the predictive dynamic policy are also run using the same set of 
random deviates. These simulations utilize both static policies and the assignment of a 
fixed number of modules to each processor as starting conditions. Performance is meas- 
ured by calculating the average percentage of time processors are occupied advancing 
modules over the course of the simulations. Performance results are averaged over 50 
trials differing only in the values of the random deviates generated. 

In Fig. 4 and Fig. 5 the performance obtained through the use of the static and 
dynamic policies is depicted. In these figures, the performance of the policies is plotted 
against the variance of the truncated normal distributions from which the module 
weights were drawn. In the experiments depicted in the above figures, the weights for 
the modules were drawn from truncated normal distributions with variances of 0.5, 1.0 
and 2.0 and mean 1, and the problem was assumed to run for 200 steps. In Fig. 4 dur- 
ing each trial 64 modules were assigned to the system while in Fig. 5 96 modules were 
assigned to the system. In both of these cases, for all variances tested, the dynamic load 
balancing method outperformed both static load balancing methods. Note however, that 
this measure of performance does not take into account the machine dependent cost of 
shifting modules between processors, a cost that will be studied in more detail below. 

The binary dissection static method was in all cases noticeably inferior to the optimal 
static load balance. The use of a static load balancing method initially had a relatively 
minor positive impact on performance in the experiments with 96 modules, and no dis- 
cernible impact at all in experiments with 64 modules. The performance impact of the 
initial use of a static load balancing method is quite dependent on the number of steps 
required to solve a problem. It will be seen later that for problems that continue for a 
relatively small number of steps, the initial use of a static load balancing method can 
markedly improve performance. 

In the dynamic load balancing method, the moving of modules from one processor 
to another will exact a cost that will depend on the details of the machines’ interproces- 
sor communication network. In Fig. 6 and Fig. 7 the average number of modules that « 

must be moved from one processor to a neighbor per step of the computation is plotted 
against performance for a range of values of the dynamic method’s safety factor. In each 
of the two figures, the use of static load balancing does play a notable role in increasing 
performance and decreasing the frequency with which blocks have to be shifted. On each 
curve in Fig. 6 and Fig. 7 both the cost and performance were strictly decreasing func- 
tions of the safety factor used. 
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The number of steps advanced are varied and the performance and the overhead in 
modules moved per step are depicted for the dynamic load balancing method in Fig. 8 
and Fig. 9 respectively. In both figures, the effects of using the two static load balancing 
methods as well as using no load balancing at the beginning of the computation are 
compared. In all cases, the performance increases with the number of steps advanced. 

For problems that do not require a large number of steps, the performance obtained 
by starting out with a static load balancing method is superior to that arising from the 
dynamic load balancing method without initial static load balancing. Perhaps somewhat 
counter-intuitively, initially balancing load with binary dissection leads to better perfor- 
mance than initially performing an optimal balance for problems requiring over 10 steps. 
The optimal static load balance is not necessarily the initial load distribution that best 
allows the dynamic load balancing method to move modules so that processor idleness is 
avoided. As the number of steps increases, the performance differences obtained through 
the use of different initial load distributions becomes less marked. 

The initial use of static load balancing also leads to marked reduction in module 
transfer overhead as depicted in Fig. 9. In this figure the overhead per step generally 
increases with the number of steps. For problems with very large numbers of steps, the 
overheads for the initial load distributions all approach a single value. When no initial 
static load balancing is used in a problem that is advanced a small number of steps, 
both low performance and relatively high costs in number of modules transferred are 
incurred. It is noted that in Figure 9, when of initial static load balancing was not used, 
the number of modules transferred reaches a local maximum for problems of 10 steps, 
and then declines briefly before resuming its long term increase. This phenomena has 
been observed in a number of similar experiments, its cause is unclear. 

The performance obtained through the use of binary dissection as a static load 
balancing method was notably poorer than that produced by the optimal balance. We 
have observed in these and other experiments that initial static load balancing used 
along with the predictive dynamic load balancing method improves performance and 
reduces the frequency with which modules must be moved. The choice of method used 
to initially balance load does not appear to have a marked impact on performance or 
cost. 

8. Conclusions 

The four load balancing methods discussed in this paper each have their own dis- 
tinct advantages and disadvantages. Finding an optimal static load balancing is in gen- 
eral an NP-Complete problem unless special structure is present to permit a low order 
polynomial solution. For the test problems that we have considered, the greedy algo- 
rithm was an order of magnitude faster than the optimal load balancing algorithm and 
it provided results as good as the optimal solutions. The binary dissection method and 
the predictive dynamic load balancing algorithms are both quite useful in situations in 
which low order polynomial solutions to the optimal static load balancing problem do 
not appear to be available. The predictive dynamic load balancing method as formulated 
here however is applicable only to algorithms with considerable regularity in subtask 
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precedence relations. 

The experimental results presented here revealed that the predictive dynamic load 
balancing method led to processor utilizations that were consistently above those 
obtained by the optimal static load balancing method. As one would expect, the optimal 
static load balancing method, in turn, consistently out performed the binary dissection i 

method. 

The initial partitioning of load at the point dynamic load balancing was initiated 
proved to have a marked effect on the performance of the dynamic load balancing algo- 
rithm. All three static load balancing methods used in conjunction with the dynamic 
load balancing method lead to a substantial improvement in performance. The magni- 
tude of these effects depended on the number of steps the problem is advanced, being 
most pronounced when a problem is finished after relatively few steps. It is interesting 
to note that the binary dissection algorithm appeared under some circumstances to con- 
sistently lead to results that were superior to optimal load balancing when used in con- 
junction with dynamic load balancing. 

One of the principal costs of the predictive dynamic load balancing method is 
expected to be the machine dependent cost of transferring the computational modules 
between processors. The effect of initial load distribution on this cost was examined and 
it was found that the frequency with which blocks were transferred between processors 
was markedly reduced when either form of static load balancing was initially employed. 

The initial distribution of load in a multiprocessor system is clearly an important 
determinant of the performance gains achievable by the dynamic load balancing policy; 
this initial distribution also has a strong influence on the overhead costs of the dynamic 
policy. 
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Fig. 3(a) A 9 module chain, with each module represented by its 
execution cost, mapped onto a 4 processor chain using the Binary 
Dissection method. The load on the most heavily loaded processor is 
8 units. 
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Fig. 3(b) Under an optimal mapping of the 9 module chain on the 
processor chain, the load on the most heavily loaded processor 
would only be 6. 
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Figure 5. 16 processors, 96 modules, each trial run for 200 steps. 

Module weights drawn from truncated normal distribution with 
unit mean and standard deviations of either 0.5, 1.0, or 2.0, 
prior to truncation. 
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Figure 6. 16 processors, 96 modules, each trial run for 200 steps. 

Module weights drawn from truncated normal with unit mean, 
standard deviation of 0.5. Circled figures represent safety 

factors. Optimal static load balancing and binary dissection 
static load balancing performance for the same input data 
included for purposes of comparison. 
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Figure 7. 16 processors, 96 modules, each trial run for 200 steps. 

Module weights drawn from truncated normal with unit mean; 
standard deviation of normal distribution is 2.0. Circled 
figures represent safety factors. Optimal static load 
balancing and binary dissection static load balancing 

7'"'fo*Ta.anr‘.e e *>r ♦■be ~t.eo input data incl_Jec for purposes of 
comparison. 
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Figure 8. 16 processors, 96 modules, each trial run for 5, 10, 20, 50, 

100, 200, 400, 800 steps. Module weights drawn from truncated 
normal with unit mean, standard deviation of 1.0. 
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Figure 9 


16 processors, 96 modules, each trial run for 5, 10, 20, 50, 
100, 200, 400, 800 steps. Module weights drawn from truncated 
normal with unit mean, standard deviation of 1.0. 
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